In this report we are working with data compiled by WHO on the life expectancy on multiple countries across the world. Life expectancy has been steadily increasing over the past few decades as technology and countries advance, but by analyzing this data we can get a much more granular idea of exactly which factors affect the Life Expectancy of a country.
Dataset being used: https://www.kaggle.com/kumarajarshi/life-expectancy-who
library(readr)
lifexdata <- read.csv("Life Expectancy Data.csv")
df <- data.frame(lifexdata)
df <- na.omit(df) # if we don't do this, the different models will show below will have different numbers of rows, preventing us from comparing them directly through ANOVA and other evaluation criteria
We can see the variables we are working with: Country, Year, Status, Life.expectancy, Adult.Mortality, infant.deaths, Alcohol, percentage.expenditure, Hepatitis.B, Measles, BMI, under.five.deaths, Polio, Total.expenditure, Diphtheria, HIV.AIDS, GDP, Population, thinness..1.19.years, thinness.5.9.years, Income.composition.of.resources, Schooling
293822Life expectancy Type: NumericWe have 3 contenders to be Factor Variables:
| Predictor | No. of Levels |
|---|---|
| Country | 193 |
| Status | 2 |
| Year | 16 |
We can immediately eliminate Country from our data frame. It's size would make our model too large and unwieldy. Similarly even Year is also extremely large, therefore we shall be using Status as our sole Factor variable.
drop <- c("Country", "Year")
df <- df[,!(names(df) %in% drop)]
df$Status <- as.factor(df$Status)
Meanwhile, one major characteristic of countries is their GDP, a reflection of the extent to which countries produce wealth. However, totaL GDP does not account for population. A large population could have relatively low per-person output. However, one would expect GDP per capita to be a major influencing factor for the simple reason that rich countries should presumably be able to help maintain a healthier population than poor countries. Accordingly, we create a new variable per_cap_GDP to capture this possibility. Unfortunately, the values become tiny and may be subject to rounding errors. We therefore translate this into a log(per_cap_GDP).
df$per_cap_GDP <- df$GDP / df$Population # values too small
df$log_per_cap_GDP <- log(df$per_cap_GDP) # better scale
Now that we have finished cleaning our full data set, we will create out Training and Test data, utilizing the recommended 80-20 Training-Test split:
trn <- sort(sample(nrow(df), nrow(df) * 0.8))
train_df <- df[trn,]
test_df <- df[-trn,]
An obvious question to start out with is whether there is significant collinearity among the variables. And one would expect a pairs plot to give us quick insight into the matter. However, because there are so many variables, it is impossible to see anything if one applies "pairs" to all the numeric variables. Therefore, we have chosen to illustrate pair plots for categories of information. we have divided the data into four categories, namely: weight, disease, money, and lifestyle.
Money
money <- subset(train_df, select = c(Life.expectancy, percentage.expenditure, log_per_cap_GDP, Income.composition.of.resources, GDP))
pairs(money, col = "darkgreen", main = "Economic Factors and Life Expectancy")
Some of the pairs plots are a little unusual. For example log_per_cap_GDP vs Income.composition.of.resources exhibits one large blob of data on the right, and a completely separate vertical line on the left (the respective orientations are top and bottom if log_per_cap is on the x axis). It is as if some other interacting factor is causing there to be two types of relationships between these variables. Income.composition.of.resources exhibits the same divided behavior when plotted vs Life.expectancy.
If we check for correlations numerically, we find the following:
(cor_money <- round(cor(money), 2))
## Life.expectancy percentage.expenditure
## Life.expectancy 1.00 0.41
## percentage.expenditure 0.41 1.00
## log_per_cap_GDP 0.37 0.30
## Income.composition.of.resources 0.72 0.41
## GDP 0.45 0.95
## log_per_cap_GDP Income.composition.of.resources
## Life.expectancy 0.37 0.72
## percentage.expenditure 0.30 0.41
## log_per_cap_GDP 1.00 0.32
## Income.composition.of.resources 0.32 1.00
## GDP 0.36 0.46
## GDP
## Life.expectancy 0.45
## percentage.expenditure 0.95
## log_per_cap_GDP 0.36
## Income.composition.of.resources 0.46
## GDP 1.00
There are some highly correlated predictors, such as GDP and percentage.expenditure (the correlation is 0.95). Clearly one of those two variables should be eliminated. Since we have already expressed GDP in terms of a transformed variable (log_per_capita_GDP), it makes more sense to eliminate GDP.
Disease
diseases <- subset(train_df, select = c(Life.expectancy, Hepatitis.B, Measles, Polio, Diphtheria, HIV.AIDS))
pairs(diseases, col = "darkred", main = "Pair Plots for Diseases and Life Expectancy")
If we examine pair plots for variables related to disease, we see the same strange phenomenon of multiple relationships between a single pair of variables. For example, polio and diphtheria exhibit not one but three collinear behaviors. When plotted against hepatitis or life expectancy, both of those diseases show the same kind of "line on left, blob on right" relationship that we saw earlier for life expectancy vs BMI and other variable pairs.
Checking for correlations numerically, we find the following.
(cor_diseases <- round(cor(diseases), 2))
## Life.expectancy Hepatitis.B Measles Polio Diphtheria HIV.AIDS
## Life.expectancy 1.00 0.21 -0.08 0.33 0.35 -0.59
## Hepatitis.B 0.21 1.00 -0.14 0.48 0.61 -0.10
## Measles -0.08 -0.14 1.00 -0.07 -0.06 0.00
## Polio 0.33 0.48 -0.07 1.00 0.62 -0.11
## Diphtheria 0.35 0.61 -0.06 0.62 1.00 -0.13
## HIV.AIDS -0.59 -0.10 0.00 -0.11 -0.13 1.00
In this case, there are no obvious candidates for removal, as the correlations tend to be small, and in fact, frequently slightly negative.
As for the curious behavior where polio and diphtheria exhibit three collinear relationships that are governed by another interactive variable, it is unclear to us what might be causing that behvior. Three obvious candidates for interactive variables would be Status, Year, and log_per_cap_GDP, but none of these pan out.
par(mfrow = c(1,3))
colors <- as.numeric(as.factor(df$Status)) + 1
plot(df$Polio ~ df$Diphtheria, col = colors, xlab = "Polio", ylab = "Diphtheria",
main = "Colored by Status")
colors <- as.numeric(as.factor(df$Year)) + 1
plot(df$Polio ~ df$Diphtheria, col = colors, xlab = "Polio", ylab = "Diphtheria",
main = "Colored by Year")
colors <- as.numeric(trunc(df$log_per_cap_GDP)*-1) + 5 # to achieve positive values
plot(df$Polio ~ df$Diphtheria, col = colors, xlab = "Polio", ylab = "Diphtheria",
main = "Colored by Log Per-Cap GDP")
None of these plots show a clear one-to-one relationship between the color (indicating Status, Year, or log_per_capita_GDP) and the three linear trends having differing slopes. We leave this perplexing phenomenon unexplained, but are confident it will not play a major role in either our explanatory or predictive models below. As mentioned earlier, other pairs exhibited similar "dual-relationship" patterns (esp. thsoe with a line on left and blob on right). Explaining these complex relationships will remain outside the scope of this study.
Weight
weight <- subset(train_df, select = c(Life.expectancy, BMI, thinness..1.19.years, thinness.5.9.years))
pairs(weight, col = "darkblue", main = "Pair Plots for Weight Indices and Life Expectancy")
When plotted against life expectancy, BMI shows a vertical line on the left and a blob on the right. Like the case of Polio vs Diphtheria, this suggests that there is a hidden interactive variable--in this case, one that produces two separate (approximately) linear relationships.
There also appears to strong collinearity between the two thinness variables. We can check the degree of collinearity numerically as follows.
(cors_weight <- round(cor(weight), 2))
## Life.expectancy BMI thinness..1.19.years
## Life.expectancy 1.00 0.56 -0.47
## BMI 0.56 1.00 -0.55
## thinness..1.19.years -0.47 -0.55 1.00
## thinness.5.9.years -0.47 -0.55 0.92
## thinness.5.9.years
## Life.expectancy -0.47
## BMI -0.55
## thinness..1.19.years 0.92
## thinness.5.9.years 1.00
As expected, the correlation between thinness..1.19.years and thinness.5.9.years is extremely high: 0.92. One of these should be dropped from the weight model. Since 1-19 is more encompassing than 5-9, we choose to keep the former.
Lifestyle
Finally, there are some miscellaneous variables that do not really fall into one category. They cover a miscellany of topics about life and death--including alcohol (consumption), adult mortality, infant or children death, schooling, and population. One could call this category "lifestyle and death", but we will refer to it as "lifestyle" for short. As above, these variables are plotted against life expectancy.
lifestyle <- subset(train_df, select = c(Life.expectancy, Alcohol, Adult.Mortality, infant.deaths, under.five.deaths, Schooling, Population, Status))
pairs(lifestyle, col = "cadetblue4", main = "Lifestyle, Schooling, Population, and Death")
From the plots, we suspect nearly perfect collinearity between infant.deaths and under.five deaths. A check for correlations confirms that.
lifestyle_numeric <- subset(train_df, select = c(Life.expectancy, Alcohol, Adult.Mortality, infant.deaths, under.five.deaths, Schooling, Population))
(cor_lifestyle <- round(cor(lifestyle_numeric), 2))
## Life.expectancy Alcohol Adult.Mortality infant.deaths
## Life.expectancy 1.00 0.41 -0.71 -0.18
## Alcohol 0.41 1.00 -0.19 -0.10
## Adult.Mortality -0.71 -0.19 1.00 0.04
## infant.deaths -0.18 -0.10 0.04 1.00
## under.five.deaths -0.20 -0.10 0.05 1.00
## Schooling 0.73 0.61 -0.42 -0.22
## Population -0.03 -0.03 -0.02 0.70
## under.five.deaths Schooling Population
## Life.expectancy -0.20 0.73 -0.03
## Alcohol -0.10 0.61 -0.03
## Adult.Mortality 0.05 -0.42 -0.02
## infant.deaths 1.00 -0.22 0.70
## under.five.deaths 1.00 -0.23 0.69
## Schooling -0.23 1.00 -0.05
## Population 0.69 -0.05 1.00
Here we see some strong correlations between variables and life expectancy. For example, adult mortality has a strong negative correlation, and schooling has a strong positive correlation. But in terms of collinear predictors, nothing jumps out here, so for now we can keep these predictors within our scope of consideration.
We can create temporary models based on these themes just to get a rough idea if these categories of variables tend to have certain trends with respect to life expectancy.
(discols <- colnames(diseases))
## [1] "Life.expectancy" "Hepatitis.B" "Measles" "Polio"
## [5] "Diphtheria" "HIV.AIDS"
(moncols <- colnames(money))
## [1] "Life.expectancy" "percentage.expenditure"
## [3] "log_per_cap_GDP" "Income.composition.of.resources"
## [5] "GDP"
(wtcols <- colnames(weight))
## [1] "Life.expectancy" "BMI" "thinness..1.19.years"
## [4] "thinness.5.9.years"
(lscols <- colnames(lifestyle))
## [1] "Life.expectancy" "Alcohol" "Adult.Mortality"
## [4] "infant.deaths" "under.five.deaths" "Schooling"
## [7] "Population" "Status"
money_model <- lm(Life.expectancy ~ percentage.expenditure + log_per_cap_GDP + Income.composition.of.resources + GDP, data = train_df)
disease_model <- lm(Life.expectancy ~ Hepatitis.B + Measles + Polio + Diphtheria + HIV.AIDS, data = train_df)
weight_model <- lm(Life.expectancy ~ BMI + thinness..1.19.years + thinness.5.9.years, data = train_df)
lifestyle_model <- lm(Life.expectancy ~ Alcohol + Adult.Mortality + infant.deaths + under.five.deaths + Schooling + Status, data = train_df)
Money Model
(s_money <- summary(money_model))
##
## Call:
## lm(formula = Life.expectancy ~ percentage.expenditure + log_per_cap_GDP +
## Income.composition.of.resources + GDP, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.761 -2.848 0.578 3.251 26.906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.21e+01 7.90e-01 65.96 < 2e-16 ***
## percentage.expenditure 4.58e-04 3.11e-04 1.47 0.14
## log_per_cap_GDP 3.70e-01 5.44e-02 6.79 1.7e-11 ***
## Income.composition.of.resources 3.04e+01 1.01e+00 30.02 < 2e-16 ***
## GDP 1.62e-05 5.03e-05 0.32 0.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.91 on 1314 degrees of freedom
## Multiple R-squared: 0.558, Adjusted R-squared: 0.557
## F-statistic: 415 on 4 and 1314 DF, p-value: <2e-16
When we look at monetary factors, log-per_cap_GDP is highly significant, as is Income composition of resources. GDP and percentage.expenditure are not. We may want to eliminate them from further consideration.
Disease Model
(s_disease <- summary(disease_model))
##
## Call:
## lm(formula = Life.expectancy ~ Hepatitis.B + Measles + Polio +
## Diphtheria + HIV.AIDS, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.328 -4.936 0.498 4.225 17.775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.1682329 0.8187009 73.49 < 2e-16 ***
## Hepatitis.B -0.0165468 0.0090506 -1.83 0.0677 .
## Measles -0.0000511 0.0000177 -2.89 0.0039 **
## Polio 0.0648138 0.0104955 6.18 8.8e-10 ***
## Diphtheria 0.0809831 0.0117728 6.88 9.3e-12 ***
## HIV.AIDS -0.8244807 0.0311605 -26.46 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.63 on 1313 degrees of freedom
## Multiple R-squared: 0.444, Adjusted R-squared: 0.442
## F-statistic: 210 on 5 and 1313 DF, p-value: <2e-16
At a glance, it seems like Measles, Polio, Diphtheria, and AIDS all have highly significant effects on life expectancy. For some reason, Hepatitis.B does not have a statistically significant effect.
Weight Model
(s_weight <- summary(weight_model))
##
## Call:
## lm(formula = Life.expectancy ~ BMI + thinness..1.19.years + thinness.5.9.years,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.265 -4.473 0.533 4.595 23.749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.2041 0.6684 96.06 <2e-16 ***
## BMI 0.1931 0.0121 16.00 <2e-16 ***
## thinness..1.19.years -0.2729 0.1116 -2.45 0.015 *
## thinness.5.9.years -0.1898 0.1104 -1.72 0.086 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.14 on 1315 degrees of freedom
## Multiple R-squared: 0.355, Adjusted R-squared: 0.353
## F-statistic: 241 on 3 and 1315 DF, p-value: <2e-16
Finally, it appears that thinness 1.19 years is an important factor, as is BMI, but thinness 5.9 years is not significant.
Lifestyle Model
(s_lifestyle <- summary(lifestyle_model))
##
## Call:
## lm(formula = Life.expectancy ~ Alcohol + Adult.Mortality + infant.deaths +
## under.five.deaths + Schooling + Status, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.87 -2.33 0.39 2.82 13.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.52329 0.95268 61.43 < 2e-16 ***
## Alcohol -0.03344 0.04406 -0.76 0.45
## Adult.Mortality -0.03222 0.00113 -28.54 < 2e-16 ***
## infant.deaths 0.09922 0.01289 7.70 2.7e-14 ***
## under.five.deaths -0.07600 0.00954 -7.97 3.5e-15 ***
## Schooling 1.49480 0.06276 23.82 < 2e-16 ***
## StatusDeveloping -1.86485 0.45235 -4.12 4.0e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.5 on 1312 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.743
## F-statistic: 636 on 6 and 1312 DF, p-value: <2e-16
Out of the lifestyle variables, it appears that Alcohol is a weak predictor, with a high p-value. We may want to jettison that variable. Also, as we noted earlier, infant.deaths and under.five.deaths are practically the same thing. "Under five"" seems to be more inclusive (it may include infants), so we discard infant.deaths.
So far it looks as though we should eliminate percentage.expenditure and GDP from the money model; Hepatitis from the disease model, thinness.5.9.years from the weight model, and Alcohol and infant.deaths from the lifestyle model.
Let's create new models without these unhelpful predictors and see if they end up being better.
Money
money_reduced <- lm(Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources, data = train_df)
s_money_reduced <- summary(money_reduced)
orig_r2 <- s_money$adj.r.squared
red_r2 <- s_money_reduced$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Original Model | 0.5566 | 4 |
| Reduced Model | 0.5473 | 2 |
Here we can see there is a 1.6667% decrease in Adjusted \(R^2\) when reducing the size of the model by 2 predictors.
Disease
disease_reduced <- lm(Life.expectancy ~ Measles + Polio + Diphtheria + HIV.AIDS, data = train_df)
s_disease_reduced <- summary(disease_reduced)
orig_r2 <- s_disease$adj.r.squared
red_r2 <- s_disease_reduced$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Original Model | 0.4422 | 5 |
| Reduced Model | 0.4412 | 4 |
Here we can see there is a 0.2249% decrease in Adjusted \(R^2\) when dereasing the size of the model by 1 predictor.
Weight
weight_reduced <- lm(Life.expectancy ~ BMI + thinness..1.19.years, data = train_df)
s_weight_reduced <- summary(weight_reduced)
orig_r2 <- s_weight$adj.r.squared
red_r2 <- s_weight_reduced$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Original Model | 0.3533 | 3 |
| Reduced Model | 0.3524 | 2 |
Downsizing the weight model by 1 predictor results in a 0.2718% decrease in Adjusted \(R^2\) .
Lifestyle
lifestyle_reduced <- lm(Life.expectancy ~ Adult.Mortality + under.five.deaths + Schooling + Population + Status, data = train_df)
s_lifestyle_reduced <- summary(lifestyle_reduced)
orig_r2 <- s_lifestyle$adj.r.squared
red_r2 <- s_lifestyle_reduced$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Original Model | 0.7431 | 6 |
| Reduced Model | 0.732 | 5 |
Here we can see there is a 1.4886% decrease in Adjusted \(R^2\) and while decreasing the size of the model by 1 predictor.
In all four cases, when assessed using adjusted R-squared, the reduced models seem to suffer very little with respect to their corresponding fuller models. We appear to be justified in using the smaller models, which we will be using below.
Breusch-Pagan Test for Heteroscedasticity
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bp_money_red <- bptest(money_reduced)$p.value
bp_disease_red <-bptest(disease_reduced)$p.value
bp_weight_red <-bptest(weight_reduced)$p.value
bp_lifestyle_red <-bptest(lifestyle_reduced)$p.value
| Reduced Model | Breusch-Pagan P-Value |
|---|---|
| Money | 2.200210^{-83} |
| Disease | 1.949810^{-12} |
| Weight | 2.171810^{-38} |
| Lifestyle | 1.568810^{-23} |
Seeing that all of our p-values are extremely low for the Breusch-Pagan Test, we have to assume that all our models fail the Equal Variance Assumption. We can confirm this by looking at the Residual-Fitted graphs.
par(mfrow = c(2,2))
#Money Model
plot(fitted(money_reduced), resid(money_reduced), col = "darkgreen",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Money Model")
abline(h = 0, lwd = 3, col = "goldenrod")
#Disease Model
plot(fitted(disease_reduced), resid(disease_reduced), col = "darkred",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Disease Model")
abline(h = 0, lwd = 3, col = "black")
#Weight Model
plot(fitted(weight_reduced), resid(weight_reduced), col = "darkblue",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Weight Model")
abline(h = 0, lwd = 3, col = "purple")
#Lifestyle Model
plot(fitted(lifestyle_reduced), resid(lifestyle_reduced), col = "cadetblue4",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Lifestyle Model")
abline(h = 0, lwd = 3, col = "brown4")
Observations:
Shapiro-Wilks Test
| Reduced Model | Shapiro-Wilks P-Value |
|---|---|
| Money | 1.489210^{-22} |
| Disease | 0.0007 |
| Weight | 1.746210^{-8} |
| Lifestyle | 2.496510^{-25} |
Seeing that all of our p-values are extremely low for the Shapiro-Wilks Test as well, we have to assume that all our models fail the Normality Assumption. We can check the QQ plots to be sure.
par(mfrow = c(2,2))
#Money Model
qqnorm(resid(money_reduced), main = "Reduced Money Model", col = "darkgreen")
qqline(resid(money_reduced))
#Disease Model
qqnorm(resid(disease_reduced), main = "Reduced Disease Model", col = "darkred")
qqline(resid(disease_reduced))
#Weight Model
qqnorm(resid(weight_reduced), main = "Reduced Weight Model", col = "darkblue")
qqline(resid(weight_reduced))
#Lifestyle Model
qqnorm(resid(lifestyle_reduced), main = "Reduced Lifestyle Model", col = "cadetblue4")
qqline(resid(lifestyle_reduced))
Observations:
Variance Inflation Factor
car::vif(money_reduced)
## log_per_cap_GDP Income.composition.of.resources
## 1.113 1.113
car::vif(disease_reduced)
## Measles Polio Diphtheria HIV.AIDS
## 1.005 1.620 1.627 1.020
car::vif(weight_reduced)
## BMI thinness..1.19.years
## 1.434 1.434
car::vif(lifestyle_reduced)
## Adult.Mortality under.five.deaths Schooling Population
## 1.229 2.061 1.630 1.960
## Status
## 1.371
All of these values look quite good. By eliminating highly correlated variables, we seem to have eliminated collinearity within these sub-models, though by performing transformations, we may possibly get even better values.
Currently our reduced models are better than the original ones, but they still have some issues with the Equal Variance and Normality assumptions. By performing some transformations on the variables however, we can improve on them significantly. (All transformations take into account that outlier data is not being removed.)
All variable transformation functions were decided by looking at the pairs plot between Life.expectancy and that particular value as well as experimenting with different functions (log, exp, ^2, ^3, 1/x, etc) while increasing the \(Adj. R^2\) value (if possible) and keeping size of the model relatively low.
Money
money_trans <- lm(Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3), data = train_df)
s_money_trans <- summary(money_trans)
red_r2 <- s_money_reduced$adj.r.squared
trans_r2 <- s_money_trans$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Reduced Model | 0.5473 | 2 |
| Transformed Model | 0.7355 | 4 |
By adding 2 predictors, the transformed money model achieves a 34.3691% increase in Adjusted \(R^2\) .
Disease
disease_trans <- lm(Life.expectancy ~ Measles + Polio + Diphtheria + HIV.AIDS + log(HIV.AIDS), data = train_df)
s_disease_trans <- summary(disease_trans)
red_r2 <- s_disease_reduced$adj.r.squared
trans_r2 <- s_disease_trans$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Reduced Model | 0.4412 | 4 |
| Transformed Model | 0.6682 | 5 |
With the transformed disease model, we achieve an 51.4742% increase in Adjusted \(R^2\) while increasing the size of the model by only 1 predictor. This is a huge improvement.
Weight
weight_trans <- lm(Life.expectancy ~ BMI + thinness..1.19.years + exp(thinness..1.19.years), data = train_df)
s_weight_trans <- summary(weight_trans)
red_r2 <- s_weight_reduced$adj.r.squared
trans_r2 <- s_weight_trans$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Reduced Model | 0.3524 | 2 |
| Transformed Model | 0.3638 | 3 |
The transformed weight model achieves a 3.2397% increase in Adjusted \(R^2\) by adding 1 predictor. The increase in Adjusted \(R^2\) is so small that it might be better to stick with the reduced model.
Lifestyle
lifestyle_trans <- lm(Life.expectancy ~ Adult.Mortality + under.five.deaths + Schooling + Population + Status + sqrt(Schooling) , data = train_df)
s_lifestyle_trans <- summary(lifestyle_trans)
red_r2 <- s_lifestyle_reduced$adj.r.squared
trans_r2 <- s_lifestyle_trans$adj.r.squared
| Adj. \(R^2\) | Predictors | |
|---|---|---|
| Reduced Model | 0.732 | 5 |
| Transformed Model | 0.7319 | 6 |
Here we can see there is a -0.0191% decrease in Adjusted \(R^2\) and while increasing the size of the model by 1 predictors. It would definitely be better to stick with the reduced in this case.
We have the Adjusted \(R^2\) for all the new models using the transformed variables, so we have a decent idea of which reduced we will be replacing with their transformed counterparts, and which will be staying the same. However, we should check to see how our transformed models treat our linear model assumptions
Breusch-Pagan Test for Heteroscedasticity
library(lmtest)
bp_money_trans <- bptest(money_trans)$p.value
bp_disease_trans <-bptest(disease_trans)$p.value
bp_weight_trans <-bptest(weight_trans)$p.value
bp_lifestyle_trans <-bptest(lifestyle_trans)$p.value
| Model | BP P-Val (Reduced) | BP P-Val (Transformed) |
|---|---|---|
| Money | 2.200210^{-83} | 2.962110^{-21} |
| Disease | 1.949810^{-12} | 4.178410^{-5} |
| Weight | 2.171810^{-38} | 1.252710^{-37} |
| Lifestyle | 1.568810^{-23} | 7.34310^{-23} |
We can see that there is a large improvement for the money and disease models, whereas there are much smaller improvements for weight and lifestyle Unfortunately, all the models appear to still be failing the Equal Variance assumption, which we can confirm by looking at the new Fitted_Residual Graphs.
par(mfrow = c(1,2))
#Reduced Money Model
plot(fitted(money_reduced), resid(money_reduced), col = "darkgreen",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Money Model")
abline(h = 0, lwd = 3, col = "goldenrod")
#Transformed Money Model
plot(fitted(money_trans), resid(money_trans), col = "darkgreen",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Transformed Money Model")
abline(h = 0, lwd = 3, col = "goldenrod")
par(mfrow = c(1,2))
#Reduced Disease Model
plot(fitted(disease_reduced), resid(disease_reduced), col = "darkred",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Disease Model")
abline(h = 0, lwd = 3, col = "black")
#Transformed Disease Model
plot(fitted(disease_trans), resid(disease_trans), col = "darkred",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Transformed Disease Model")
abline(h = 0, lwd = 3, col = "black")
par(mfrow = c(1,2))
#Reduced Weight Model
plot(fitted(weight_reduced), resid(weight_reduced), col = "darkblue",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Weight Model")
abline(h = 0, lwd = 3, col = "purple")
#Transformed Weight Model
plot(fitted(weight_trans), resid(weight_trans), col = "darkblue",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Transformed Weight Model")
abline(h = 0, lwd = 3, col = "purple")
par(mfrow = c(1,2))
#Reduced Lifestyle Model
plot(fitted(lifestyle_reduced), resid(lifestyle_reduced), col = "cadetblue4",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Reduced Lifestyle Model")
abline(h = 0, lwd = 3, col = "brown4")
#Transformed Lifestyle Model
plot(fitted(lifestyle_trans), resid(lifestyle_trans), col = "cadetblue4",
ylab = "Residuals",
xlab = "Fitted Values",
main = "Transformed Lifestyle Model")
abline(h = 0, lwd = 3, col = "brown4")
Observations:
Our graphs confirm what our results of our BP tests. The money model and disease model both improved greatly, while the weight and lifestyle models stayed basically the same.
We also want to see if there has been any improvement in the Normality Assumptions so we shall run the Shapiro-Wilks test and compare.
Shapiro-Wilks Test
| Reduced Model | (Reduced) Shapiro-Wilks P-Value | (Transformed) Shapiro-Wilks P-Value |
|---|---|---|
| Money | 1.489210^{-22} | 7.638710^{-15} |
| Disease | 0.0007 | 2.563710^{-5} |
| Weight | 1.746210^{-8} | 1.781910^{-8} |
| Lifestyle | 2.496510^{-25} | 2.641310^{-25} |
These values are strange, it appears that the only model who's P-value improved was the money model. The weight and lifestyle models stayed relatively the same while the disease model's p-value actually got worse. Let's look at the graphs while comparing them to the reduced models.
par(mfrow = c(1,2))
#Reduced Money Model
qqnorm(resid(money_reduced), main = "Reduced Money Model", col = "darkgreen")
qqline(resid(money_reduced))
#Transformed Money Model
qqnorm(resid(money_trans), main = "Transformed Money Model", col = "darkgreen")
qqline(resid(money_trans))
As expected, there is a much better improvement in the upper quantile for the money model.
par(mfrow = c(1,2))
#Reduced Disease Model
qqnorm(resid(disease_reduced), main = "Reduced Disease Model", col = "darkred")
qqline(resid(disease_reduced))
#Transformed Disease Model
qqnorm(resid(disease_trans), main = "Transformed Disease Model", col = "darkred")
qqline(resid(disease_trans))
At first glance, there appears to be a slight deterioration in both the upper and lower quantiles of the new graph. But if one considers the scales of the Y-axis, one can see that the new model might actually be an improvement.
par(mfrow = c(1,2))
#Reduced Weight Model
qqnorm(resid(weight_reduced), main = "Reduced Weight Model", col = "darkblue")
qqline(resid(weight_reduced))
#Transformed Weight Model
qqnorm(resid(weight_trans), main = "Transformed Weight Model", col = "darkblue")
qqline(resid(weight_trans))
The weight graph appears to have not really changed at all, which corresponds to the new SW p-value.
par(mfrow = c(1,2))
#Reduced Lifestyle Model
qqnorm(resid(lifestyle_reduced), main = "Reduced Lifestyle Model", col = "cadetblue4")
qqline(resid(lifestyle_reduced))
#Transformed Lifestyle Model
qqnorm(resid(lifestyle_trans), main = "Transformed Lifestyle Model", col = "cadetblue4")
qqline(resid(lifestyle_trans))
There is a very slight improvement in the upper quartile of the lifestyle graph, while the lower quartile remains the same.
ANOVA Test
Since all the Reduced Models are nested within the Transformed models, we can also run ANOVA tests to see if there is a significant improvement in the transformed models.
| Model | Reduced vs Transformed P-Value |
|---|---|
| Money | 2.033310^{-154} |
| Disease | 4.403210^{-151} |
| Weight | 7.920110^{-7} |
| Lifestyle | 0.5753 |
Preferred Models out of the Reduced and Models
After comparing all these different tests--namely, ANOVA, SW-Test, BP-Test--and taking in account the change in number of parameters and change in Adjusted \(R^2\), we believe that we should continue forward with the following models:
Thus far, we have identified promising variables in each of several categories (themes), and we have also identified a few useful transformations. The next step is to try to create a model that combines data from all four categories of variables into one larger model. There are a number of ways to do this. We have chosen to keep all the variables we have identified as "good" within the scope, and use bi-directional stepping with AIC and BIC.
Below, we employ the transformed money model as the starter, though, in principle, any of the above four selected models should work equally well.
AIC
starter <- money_trans
aic_model <- step(starter,
scope = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + Measles + Polio + Diphtheria + HIV.AIDS + log(HIV.AIDS) + BMI + thinness..1.19.years + Adult.Mortality + under.five.deaths + Schooling + Population + Status,
direction = "both",
trace = 0)
summary(aic_model)
##
## Call:
## lm(formula = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources +
## I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) +
## log(HIV.AIDS) + Adult.Mortality + HIV.AIDS + under.five.deaths +
## Population + BMI + Diphtheria, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.836 -1.697 -0.017 1.807 10.573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.60e+01 7.06e-01 93.53 < 2e-16 ***
## log_per_cap_GDP 6.80e-02 3.11e-02 2.18 0.02908 *
## Income.composition.of.resources -4.01e+01 4.48e+00 -8.94 < 2e-16 ***
## I(Income.composition.of.resources^2) 9.72e+01 1.11e+01 8.79 < 2e-16 ***
## I(Income.composition.of.resources^3) -4.17e+01 7.40e+00 -5.64 2.1e-08 ***
## log(HIV.AIDS) -1.17e+00 1.07e-01 -10.92 < 2e-16 ***
## Adult.Mortality -1.30e-02 9.61e-04 -13.52 < 2e-16 ***
## HIV.AIDS -2.15e-01 2.30e-02 -9.33 < 2e-16 ***
## under.five.deaths -2.64e-03 7.43e-04 -3.56 0.00039 ***
## Population 3.90e-09 1.82e-09 2.15 0.03208 *
## BMI 1.29e-02 5.73e-03 2.25 0.02484 *
## Diphtheria 7.83e-03 4.33e-03 1.81 0.07069 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.15 on 1307 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.874
## F-statistic: 833 on 11 and 1307 DF, p-value: <2e-16
BIC
n <- nrow(train_df)
bic_model <- step(starter, k = log(n),
scope = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + Measles + Polio + Diphtheria + HIV.AIDS + log(HIV.AIDS) + BMI + thinness..1.19.years + Adult.Mortality + under.five.deaths + Schooling + Population + Status,
direction = "both",
trace = 0)
summary(bic_model)
##
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources +
## I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) +
## log(HIV.AIDS) + Adult.Mortality + HIV.AIDS + under.five.deaths,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.944 -1.736 -0.031 1.812 10.911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.555779 0.590117 112.78 < 2e-16 ***
## Income.composition.of.resources -44.468727 4.346403 -10.23 < 2e-16 ***
## I(Income.composition.of.resources^2) 108.625382 10.685279 10.17 < 2e-16 ***
## I(Income.composition.of.resources^3) -48.185565 7.230695 -6.66 3.9e-11 ***
## log(HIV.AIDS) -1.203883 0.107116 -11.24 < 2e-16 ***
## Adult.Mortality -0.013152 0.000962 -13.67 < 2e-16 ***
## HIV.AIDS -0.206210 0.023001 -8.97 < 2e-16 ***
## under.five.deaths -0.002080 0.000525 -3.96 7.9e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.16 on 1311 degrees of freedom
## Multiple R-squared: 0.874, Adjusted R-squared: 0.873
## F-statistic: 1.29e+03 on 7 and 1311 DF, p-value: <2e-16
not_in_BIC <- names(coef(aic_model))[!(names(coef(aic_model)) %in% names(coef(bic_model)))]
The AIC model is larger, containing all the same predictors as BIC as well as log_per_cap_GDP, Population, BMI, Diphtheria. This makes sense, as BIC penalizes larger models more heavily and tends to result in fewer predictors. Let's compare the models using an ANOVA test to see whether the AIC model is significantly better.
ANOVA P-val: 0.0037
This p-val suggests that we should move forward with the AIC model. To confirm that it is the best model we have, we also compare it with the four category models we selected earlier, as well as a new full additive model that contains all the variables of our data frame (except for the factor variables) as well as any transformed variables from our smaller models.
full_model <- lm(Life.expectancy ~ . - GDP - Population + I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) + log(HIV.AIDS), data = train_df) #We are also removing the GDP and Population Params as we will use the log_per_cap_GDP instead
summary(full_model)
##
## Call:
## lm(formula = Life.expectancy ~ . - GDP - Population + I(Income.composition.of.resources^2) +
## I(Income.composition.of.resources^3) + log(HIV.AIDS), data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.103 -1.726 -0.031 1.655 11.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.80e+01 1.14e+00 59.93 < 2e-16 ***
## StatusDeveloping -5.45e-01 3.49e-01 -1.56 0.11841
## Adult.Mortality -1.17e-02 9.46e-04 -12.33 < 2e-16 ***
## infant.deaths 4.68e-02 9.68e-03 4.84 1.5e-06 ***
## Alcohol -1.59e-01 3.24e-02 -4.90 1.1e-06 ***
## percentage.expenditure 2.28e-04 6.45e-05 3.53 0.00042 ***
## Hepatitis.B -5.25e-03 4.33e-03 -1.21 0.22507
## Measles -2.78e-06 9.86e-06 -0.28 0.77820
## BMI 8.13e-03 5.99e-03 1.36 0.17507
## under.five.deaths -3.53e-02 7.07e-03 -4.99 6.9e-07 ***
## Polio -1.59e-03 4.96e-03 -0.32 0.74901
## Total.expenditure 1.75e-01 4.04e-02 4.33 1.6e-05 ***
## Diphtheria 9.72e-03 5.62e-03 1.73 0.08427 .
## HIV.AIDS -2.35e-01 2.30e-02 -10.23 < 2e-16 ***
## thinness..1.19.years 5.53e-02 4.92e-02 1.12 0.26124
## thinness.5.9.years -1.18e-01 4.87e-02 -2.43 0.01506 *
## Income.composition.of.resources -5.57e+01 5.21e+00 -10.70 < 2e-16 ***
## Schooling -1.10e-01 7.39e-02 -1.49 0.13724
## per_cap_GDP 3.02e-02 2.28e-02 1.33 0.18529
## log_per_cap_GDP 1.12e-02 3.12e-02 0.36 0.72037
## I(Income.composition.of.resources^2) 1.38e+02 1.33e+01 10.42 < 2e-16 ***
## I(Income.composition.of.resources^3) -6.75e+01 8.88e+00 -7.60 5.8e-14 ***
## log(HIV.AIDS) -9.20e-01 1.11e-01 -8.32 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.06 on 1296 degrees of freedom
## Multiple R-squared: 0.883, Adjusted R-squared: 0.881
## F-statistic: 446 on 22 and 1296 DF, p-value: <2e-16
full_r2 <- summary(full_model)$adj.r.squared
aic_r2 <- summary(aic_model)$adj.r.squared
#This function will be used to calculate the RMSE of our models
rmse = function(model, training = TRUE){
if(training){
y = train_df$Life.expectancy
y_hat = predict(model,newdata = train_df)
n = nrow(train_df)
} else{
y = test_df$Life.expectancy
y_hat = predict(model,newdata = test_df)
n = nrow(test_df)
}
sqrt(sum((y - y_hat) ^ 2) / n)
}
| AIC Model vs | ANOVA P-val |
|---|---|
| Transformed Money | 4.450210^{-207} |
| Transformed Disease | 5.168710^{-272} |
| Reduced Weight | 0 |
| Reduced Lifestyle | 1.51710^{-211} |
| Full | 4.324810^{-14} |
According to these results, our AIC model is better than all of the models except for the Full model. However, when we look at the adjusted R-squared values:
| Model | Adjusted \(R^2\) | No. Of Predictors |
|---|---|---|
| AIC | 0.8741 | 11 |
| Full | 0.8813 | 22 |
we can see that there is a 0.8272% increase in Adjusted \(R^2\) while the model size increases by 11 predictors. It would definitely be better to stick with the AIC Model. Let's now confirm these results using test data.
So far we have not considered the possibility of overfitting. It would make sense, therefore, to validate our preliminary conclusions using test datasets to see how these models behave.
| Model | RMSE (train) | RMSE (test) | No of Predictors |
|---|---|---|---|
| AIC Model | 3.1354 | 3.1677 | 11 |
| Full Model | 3.0312 | 3.0279 | 22 |
Looking at the RMSE we can see that the full model has a lower RMSE for both the train and test data, but it also has almost twice as many predictors. Since smaller models are generally preferable and the improvement is not that large, we select the AIC-derived combined model as our final choice.
bp_aic <- bptest(aic_model)$p.value
sw_aic <- shapiro.test(resid(aic_model))$p.value
adj_r2 <- summary(aic_model)$adj.r.squared
summary(aic_model)
##
## Call:
## lm(formula = Life.expectancy ~ log_per_cap_GDP + Income.composition.of.resources +
## I(Income.composition.of.resources^2) + I(Income.composition.of.resources^3) +
## log(HIV.AIDS) + Adult.Mortality + HIV.AIDS + under.five.deaths +
## Population + BMI + Diphtheria, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.836 -1.697 -0.017 1.807 10.573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.60e+01 7.06e-01 93.53 < 2e-16 ***
## log_per_cap_GDP 6.80e-02 3.11e-02 2.18 0.02908 *
## Income.composition.of.resources -4.01e+01 4.48e+00 -8.94 < 2e-16 ***
## I(Income.composition.of.resources^2) 9.72e+01 1.11e+01 8.79 < 2e-16 ***
## I(Income.composition.of.resources^3) -4.17e+01 7.40e+00 -5.64 2.1e-08 ***
## log(HIV.AIDS) -1.17e+00 1.07e-01 -10.92 < 2e-16 ***
## Adult.Mortality -1.30e-02 9.61e-04 -13.52 < 2e-16 ***
## HIV.AIDS -2.15e-01 2.30e-02 -9.33 < 2e-16 ***
## under.five.deaths -2.64e-03 7.43e-04 -3.56 0.00039 ***
## Population 3.90e-09 1.82e-09 2.15 0.03208 *
## BMI 1.29e-02 5.73e-03 2.25 0.02484 *
## Diphtheria 7.83e-03 4.33e-03 1.81 0.07069 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.15 on 1307 degrees of freedom
## Multiple R-squared: 0.875, Adjusted R-squared: 0.874
## F-statistic: 833 on 11 and 1307 DF, p-value: <2e-16
Our final model has the following properties:
| No. of Predictors | 11 |
| BP-Test P Val | 9.885410^{-6} |
| SW-Test P Val | 7.150810^{-11} |
| Adjusted $R^2 | 0.8741 |
plot(fitted(aic_model), resid(aic_model), col = "purple",
ylab = "Residuals",
xlab = "Fitted Values",
main = "AIC-Derived Combined Model")
abline(h = 0, lwd = 3, col = "darkblue")
qqnorm(resid(aic_model), main = "AIC-Derived Combined Model", col = "purple")
qqline(resid(aic_model))
Our final model has a high \(R^2\) value, which is good, but still has some issues with normality assumptions, especially in the lower quartiles. This makes sense because in the intial pair plots, many of the graphs had a clusters of data points in the lower ranges. By removing these outliers, we could have improved our model's BP and SW test results. However, without understanding why those data points were so separate from the rest of the data, we deemed that it would be unwise to remove them, and decided to work with them instead.
We also noticed that many pair plots seemed to point to complex collinear relationships between pairs of variables, where more than one linear relationships could be identified simultaneously. We briefly tried to identify possible interactions, but failed to find a variable that might explain such behaviors.
Overall, breaking a large number of variables into thematic categories proved useful in quickly eliminating collinear variables that had equivalent effects on life expectancy. We were able to combine these models into a larger that took all categories of variables into account. Ultimately, however, less-intuitive but ultimately more rewarding transformations discovered through trial and error were needed to produce a better predictive model.
Group Members:
+ Warren Child
+ Zoheb Satta
+ Yoga Mahalingam